The impact of genotyping strategies and statistical models on accuracy of genomic prediction for survival in pigs

Background Survival from birth to slaughter is an important economic trait in commercial pig productions. Increasing survival can improve both economic efficiency and animal welfare. The aim of this study is to explore the impact of genotyping strategies and statistical models on the accuracy of genomic prediction for survival in pigs during the total growing period from birth to slaughter. Results We simulated pig populations with different direct and maternal heritabilities and used a linear mixed model, a logit model, and a probit model to predict genomic breeding values of pig survival based on data of individual survival records with binary outcomes (0, 1). The results show that in the case of only alive animals having genotype data, unbiased genomic predictions can be achieved when using variances estimated from pedigree-based model. Models using genomic information achieved up to 59.2% higher accuracy of estimated breeding value compared to pedigree-based model, dependent on genotyping scenarios. The scenario of genotyping all individuals, both dead and alive individuals, obtained the highest accuracy. When an equal number of individuals (80%) were genotyped, random sample of individuals with genotypes achieved higher accuracy than only alive individuals with genotypes. The linear model, logit model and probit model achieved similar accuracy. Conclusions Our conclusion is that genomic prediction of pig survival is feasible in the situation that only alive pigs have genotypes, but genomic information of dead individuals can increase accuracy of genomic prediction by 2.06% to 6.04%. Supplementary Information The online version contains supplementary material available at 10.1186/s40104-022-00800-5.


Background
Survival from birth to slaughter is an important economic trait in commercial pig productions. Increased survival also improves the welfare in pigs. According to productivity data, the cumulative survival rate from birth to slaughter is lower than 70% [1], and in addition there has been a downward trend for piglet pre-weaning survival in the past ten years [2]. Use of genomic information in the selection program will be a sustainable and effective way to reduce pig mortality. As a powerful genetic improvement tool, genomic selection has been widely used in animal breeding, such as in cattle [3][4][5], pig [6][7][8], and chicken [9][10][11]. Genomic selection is especially beneficial for the traits with low heritability that have slow genetic progress when using traditional pedigree-based methods [12][13][14]. Guo et al. [15] studied the accuracy of estimated breeding values for piglet survival rate from birth to day 5 and reported that the accuracy for the single-step method was higher than for pedigree-based method by 14.2% for Landrace, and by 7.2% for Yorkshire. In a crossbred pig population, Leite et al. [16] compared the accuracies of *Correspondence: guosheng.su@qgg.au.dk the estimated breeding values of mortality at five stages from birth to slaughter, and reported that the accuracy for the single-step method was 16.7%-78.9% higher than for pedigree-based method, with the largest improvement of accuracy for lactation mortality and smallest improvement for postweaning mortality.
Usually, like litter size, piglet survival is recorded as a trait of the sow or the service sire [15,16]. However, survival is a complex trait that is also affected by the pig's own genotype. It may therefore be more appropriate to assess genetic merit of survival at individual level [17]. However, evaluating survival at individual level will introduce problems with genotyping strategies in the sense that, generally, dead individuals do not have genotypes. Using only the genotype data of alive individuals may lead to biased genomic predictions. The influence of the genotype of the dead individuals on the accuracy and unbiasedness of genomic prediction needs to be studied. Finally, survival at individual level is a binary trait which does not obey a normal distribution, and thus conventional statistical analysis methods may not be suitable [18]. Therefore, when estimating the breeding value, a logit model or a liability threshold model could be more appropriate. However, Koeck et al. [19] evaluated the performance of a linear model and a logit model for genetic analyses of clinical mastitis in Austrian Fleckvieh dual purpose cows and found that there was no difference in the predictive ability between the linear model and the logit model. In the Norwegian Red cows population, Vazquez et al. [20] also compared the genetic evaluation of a liability threshold model with a linear model for clinical mastitis, where the results also showed that there was no difference in the predictive capabilities of the two models. It is necessary to investigate if a logit or a liability threshold model is better than a linear model for predicting breeding value of survival in pig populations.
We hypothesized that different genotyping strategies affect accuracy and unbiasedness in the breeding value estimation. Furthermore, we hypothesized that logit or liability threshold models are more suitable for predicting threshold traits as well for genomic prediction as without genomic information. Therefore, this study has two objectives: (1) explore the impact of genotyping scenarios, especially no genotypes of dead individuals on genomic prediction of mortality; (2) assess linear versus logit and liability threshold models in estimation of breeding value.

Data simulation
The data were simulated using QMSim software [21] mimicking a pig population. In this study, we simulated 18 chromosomes, each chromosome was 100 cM, had 3100 markers and 50 QTLs. It was assumed that the QTL effects had a normal distribution. The simulation started with a founder population of 200 males and 200 females, and went through 300 non-overlapping historical generations to generate linkage disequilibrium between markers and QTLs. In total, about 45,000 markers and 730 QTLs were segregating in the genome for the last historical population, with slight differences in the number of markers and QTLs of each repetition. After historical generations, 30 boars randomly selected from the last history generation and all 200 sows in the generation were used to create a base population. After this, the population went through eight non-overlapping generations. In each generation, 30 sires and 300 dams were randomly selected from alive animals (see below on how survival/death of animals was simulated), a sire mated 10 dams randomly, and each dam produced one litter. The litter sizes were 10, 12, 14, 16, or 18 with the probabilities 0.02, 0.14, 0.68, 0.14, 0.02, respectively, and sex ratio of piglets was 1:1. The data from generations 5 ~ 8 were used in the analysis.
The phenotypic liability of an individual to be alive was generated as the sum of direct additive genetic effect of the individual, maternal additive genetic effect of the dam, litter effect and random residual. Fixed effects (such as herd-year-month) were not considered. In this study, three survival traits with different variances and covariances were simulated, i.e., direct heritability and maternal heritability were set as 0.04 and 0.04 (T 4/4 ), 0.02 and 0.04 (T 2/4 ), or 0.02 and 0.02 (T 2/2 ), respectively. The genetic correlation between direct and maternal additive genetic effects was 0.30. The variance of the litter effect was the same as the maternal additive genetic variance. The direct and maternal QTL allele effects were sampled from a bivariate normal distribution with the specified correlation. The true breeding values (TBVs) of direct and maternal additive genetic effect were defined as the sum of the QTL allele effects, and these TBVs were scaled to have the variances as the designed values [22]. The other random effects were sampled from normal distributions with the corresponding variance. The phenotype in observed scale was scored as 1 if the liability to survival was the top 80%, and otherwise 0, i.e., the mortality rate was 20%. Each of the three traits with different heritability was simulated with 40 replicates.

Statistical analysis
A linear, a logit and a probit model (i.e., a liability threshold model) were used for estimation of genetic parameters and breeding values. The models were as follows: where y is the vector of binary observations of pig survival with 0 and 1 representing dead and alive, respectively; µ is the overall mean; 1 is the vector of ones; l is the vector of litter effects; a is the vector of direct additive genetic effects; m is the vector of maternal additive genetic effects; and e is the vector of residual effects. The matrices W l , Z a , Z m are incidence matrixes associating l, a, m with y. In the model, direct and maternal additive genetic effects are correlated, and the other effects are independent of each other. Thus, it is assumed that l, e, a and m have the following distributions:l ∼ N 0, e , σ 2 a , σ 2 m and σ am are litter variance, residual variance, direct additive genetic variance, maternal additive genetic variance, and covariance between direct and maternal additive genetic effects, respectively, and K is an additive genetic relationship matrix based on pedigree and/or genomic information. When using the pedigree-based method for the scenario of no genotyping, K was constructed from pedigree information [23]. When using the single-step GBLUP model (ssGBLUP), K represents the H matrix constructed from pedigree and genome information [24]. The H matrix is as follows, where A 11 and A 22 are the sub-matrixes of pedigreebased relationship matrix (A) for relationships between genotyped individuals and between non-genotyped individuals, respectively, A 12 or A 21 are the sub-matrixes for relationships between genotyped and non-genotyped individuals and G ω = (1 − ω)G * + ωA 11 . In this study, ω is set to 0.2. G was the marker-based genomic relationship matrix [25], G* is the adjustment matrix of G, which is calculated by the following formula [8], In the scenario where all animals are genotyped, K = G_ω.
Avg.offdiag(G)β + α = Avg.offdiag(A 11 ) The logit model and probit model (also called liability threshold model) are described as, For the logit model (LG), η is the vector of log-odds of the expected pig survival, is the inverse cumulative standard normal distribution function. The vectors µ, l, a, m, and the matrixes W l , Z a , Z m are defined similar to those in the linear model.
The variance components were estimated using AI-REML method [26]. The AI-REML procedure for some ssGBLUP model did not converge. Therefore, variance components estimated from pedigree-based models were used in estimation of breeding values in all models. The estimation of variance components and breeding values was performed using the DMU software [27].

Validation of genomic predictions
To validate genomic prediction, the 5 ~ 7 th generations were used as reference population, and the 8 th generation was used as validation population. In this study, genomic predictions were evaluated using the following criteria: 1) The correlation between the estimated breeding value (EBV) and the true breeding value (TBV, i.e., a, m or a + m in liability scale in the simulation) to assess the accuracy of genomic prediction; 2) Average true breeding value of the top 1%, 30% of all individuals in EBVs to assess the realized selection differential, where 1% can be considered as selection intensity for boars and 30% for sows; 3) Regression of EBV from whole data with genotypes of all animals on the EBV from reference data for each genotyping scenario, similar to Legarra and Reverter's study [28], to evaluate dispersion bias of a particular model and genotyping scenario. Note that dispersion bias was assessed by comparing the EBV using full data information instead of true breeding value. The reason was that the true BV in the simulation was BV of liability, but the EBV from linear model was in observed scale and EBV from logit model was in logit scale. Even for probit model, the scale of EBV was also different from simulated TBV, before a restriction of residual variance being 1 in the probit model. Thus, the expected regression of true BV on EBV was not equal to one even in the case of unbiased prediction. Paired t-test was used to test the difference between accuracies of EBV from the four genotyping strategies and from the three models.

Results
The variance components estimated from the model with pedigree-based relationship matrix were used for estimation of breeding values. Heritabilities estimated using pedigree information are shown in Table 1. Proportions η = 1µ + W l l + Z a a + Z m m of variances and heritabilities were different among the three models due to different scales. For traits T 4/4 and T 2/2 , when using the logit model and the probit model, the estimated direct heritability ranged from 0.011 to 0. 22 and was lower than the estimated maternal heritability, which ranged from 0.019 to 0.039. This was unexpected since direct and maternal heritabilities were the same in the simulation for the two traits. For the three models, the estimates of correlation coefficients between the direct and maternal additive effects ranged from 0.286 to 0.523, and had large standard errors. Accuracies of EBV were measured as correlation coefficients between EBV and TBV. Accuracies of estimated direct (a), maternal (m) and total (a + m) breeding values are shown in Table 2. Models using genomic information achieved up to 59.2% higher accuracy of estimated breeding value than models using pedigree information, dependent on genotyping scenarios. Accuracies of EBV for a from the three models using only pedigree-based relationship matrix (scenario G_none) ranged from 0.287 to 0.288 for trait T 4/4 , 0.242 to 0.245 for T 2/4 and 0.224 to 0.226 for T 2/2 . When using genomic data across the three scenarios (G_all, G80_ran, G_alive), the accuracies ranged from 0.375 to 0.459 for T 4/4 , 0.293 to 0.352 for T 2/4 and 0.286 to 0.340 for T 2/2 . Accuracies of EBV for the maternal effect, m using only pedigree-based relationship matrix ranged from 0.247 to 0.251 for trait T 4/4 , 0.264 to 0.270 for T 2/4 and 0.196 to 0.197 for T 2/2 . When using genomic data and across all scenarios, the accuracies of maternal effect ranged from 0.385 to 0.409 for T 4/4 , 0.397 to 0.418 for T 2/4 and 0.310 to 0.325 for T 2/2 . Accuracies of EBV for total genetic effect, a + m using pedigreebased models without genomic information ranged from 0.314 to 0.315 for trait T 4/4 , 0.310 to 0.311 for T 2/4 and 0.249 for T 2/2 . Across all scenarios with genomic data, the Table 1 Estimates of proportion of litter variance (lit 2 ), direct heritability ( h 2 a ), maternal heritability ( h 2 m ), and correlation between direct and maternal additive genetic effects (r am ) using models incorporating pedigree-based relationship matrix 1  As expected, for the three types of EBV (a, m, and a + m), the scenario of all individuals, including dead individuals, being genotyped (G_all) had the highest accuracy. The composition of genotyping individuals affected the accuracies of EBV for a and a + m, but not for m. In scenario of G_alive, the accuracies of EBV for a were 0.375 to 0.378 for trait T 4/4 , 0.293 to 0.299 for T 2/4 and 0.286 to 0.288 for T 2/2 . With the same size of genotyped pigs, the accuracies of G80_ran were higher than those in G_alive by 12.70% ~ 13.76% for trait T 4/4 , 10.92% ~ 12.20% for T 2/4 and 10.14% ~ 11.46% for T 2/2 . The trend of accuracies for a + m was the same as that for a. Thus, the accuracies of EBV for a + m in G_alive were 0.447 to 0.449 for trait T 4/4 , 0.428 to 0.429 for T 2/4 and 0.359 to 0.360 for T 2/2 , and the accuracies of G80_ ran were higher than those in G_alive by 5.35% ~ 6.04% for trait T 4/4 , 2.56% ~ 2.57% for T 2/4 and 3.06% ~ 3.34% for T 2/2 . However, the trend of accuracies for m was different from those for a and a + m in terms of composition of genotyped individuals. The accuracies of EBV for m in G80_ran were similar to those in G_alive, and the differences among them were less than 0.01 for the three traits (P < 0.05).
As shown in Table 2, accuracies of the linear model were very similar to the logit and probit models for the three types of EBV, and the differences among them were less than 0.01 for the three traits. The differences of accuracies for a ranged from 0 to 0.008 for trait T 4/4 , 0 to 0.008 for T 2/4 and 0 to 0.007 for T 2/2 . The differences of accuracies for m ranged from 0 to 0.008 for trait T 4/4 , 0.001 to 0.006 for T 2/4 and 0 to 0.001 for T 2/2 . The differences of accuracies for a + m ranged from 0 to 0.002 for trait T 4/4 , 0 to 0.001 for T 2/4 and 0 to 0.001 for T 2/2 .
In scenarios of G80_ran and G_alive, 20% animals did not have genotype data. Additional file 1: Table S1 shows that the accuracies of genotyped individuals were higher than those of non-genotyped pigs. The differences of accuracies for a ranged from 0.077 to 0.093 for trait T 4/4 , 0.037 to 0.046 for T 2/4 and 0.061 to 0.072 for T 2/2 . The differences of accuracies for m ranged from 0.058 to 0.090 for trait T 4/4 , 0.053 to 0.074 for T 2/4 and 0.058 to 0.087 for T 2/2 . The differences of accuracies for the total EBV ranged from 0.094 to 0.109 for trait T 4/4 , 0.068 to 0.086 for T 2/4 and 0.079 to 0.094 for T 2/2 . In addition, the accuracies of the three types of EBV for non-genotyped animals (Additional file 1: Table S1) were higher than those for animals in scenario of without any genotype information ( Table 2, G_none).
The regression coefficients of the EBV from the whole data with all animals having genotypes on the EBV from different reference data are presented in Table 3. The range of the regression coefficients of direct EBV were between 1.046 and 1.132 for T 4/4 , 1.001 and 1.126 for T 2/4 , 0.944 and 1.019 for T 2/2 . The range of the regression coefficients of maternal (m) EBV were between 0.895 and 0.938 for T 4/4 , 1.057 and 1.085 for T 2/4 , 1.000 and 1.043 for T 2/2 . The range of the regression coefficients of the total EBV (a + m) were between Table 3 Regression coefficient of the EBV from whole data on the EBV from reference data   Table 4 shows the mean total TBV of the top 1% individuals with highest total EBV. It was observed that the higher the accuracy of EBV for a + m (Table 2), the higher the TBV. For trait T 4/4 , the scenario of all individuals with genotypes obtained the highest TBV for a + m (4.498 to 4.553), followed by scenario G80_ran (4.297 to 4.346), after then by scenario G_alive (4.221 to 4.308), and the lowest was scenario G_none (2.583 to 2.712). The order of TBV for a + m from the four scenarios was the same in the other two traits T 4/4 and T 2/4 . The order of TBV for a is the same as that for a + m but not for m. The order of TBV for m between the scenarios G80_ran and G_alive was changed, G_alive was higher than G80_ran for T 4/4 and T 2/2 . When using genomic data, TBVs for a from linear model were higher than those from logit model and probit model. However, using pedigree-based models without genomic information, TBVs for a from linear model were lower than the logit and probit models. With or without genomic information, TBVs for maternal effect, (m) from linear model were lower than those from the logit and probit models for all traits. Table 5 shows the mean total TBV of the top 30% individuals with highest total EBV. For all traits, the order of the four scenarios of total TBV of the top 30% individuals is consistent with that of the top 1% individuals, i.e., scenario G_all obtained the highest TBV, followed by scenario G80_ran, after then by scenario G_alive, and the lowest was scenario G_none. In the four scenarios, linear model outperformed the logit and probit models for a, but not for m.

Discussion
In this study, we compared four genotyping strategies and three prediction models when predicting breeding values for three pig survival traits with different direct and maternal heritabilities. When using variance components estimated from pedigree-based model, genomic predictions were unbiased with respect to dispersion of predictions, even for the scenario with genotypes only from alive animals. Random genotyping individuals led to higher prediction accuracy than only genotyping alive individuals, given the same number of genotyped animals. The linear model can achieve similar genomic prediction ability as the logit and probit models.
In the current study, variance components were estimated from pedigree-based model and these estimates were used for predicting breeding values in all genotyping scenarios. It has been reported that when selection is based on genomic information, genetic parameters estimated without this information can be biased [29]. Similarly, when selection is based on pedigree information, Table 4 The mean of true breeding value of the top 1% of animals with the highest total estimated breeding value  genetic parameters estimated using ssGBLUP model can also be biased [30]. However, the impact of selection on variance components estimates was not an issue in the current study, because the simulated population was a random selection population. On the other hand, the current study involved the issue of selective genotyping. In a pig breeding program, dead animals are usually not genotyped, which may lead to biased estimation of variance components and genomic prediction when using a genomic model for parameter estimation. We carried out an extra simulation study using models with genomic data and found that parameter estimation using ssGBLUP model with genotypes only from alive animals severely overestimated additive genetic variance and led to a residual variance close to zero (Additional file 1: Table S3). Similarly, Wang et al. [31] reported that selective genotyping severely overestimated additive genetic variance using a ssGBLUP model. Due to problems with convergence and biased estimation of variance components in some scenarios, variances estimated from pedigree-based models were used for predicting breeding values in the current study. Due to the estimates from the three models are on different scales, they cannot be directly compared. By a transformation from observed scale heritability to liability scale heritability [32], the liability scale heritabilities estimated from the linear model were consistent with those used in simulating data. However, the logit and probit model underestimated direct heritabilities and overestimated the correlation between direct and maternal additive genetic effects. The possible reason could be that including maternal additive genetic effect in the model increase model complexity, and it is difficult to distinguish direct and maternal additive genetic effects as reflected by large standard error for the estimates of correlation between direct and maternal additive genetic effects in this study. The logit and probit animal model could be more sensitive to model complexity compared with the linear animal model. This could be also the reason that the logit and probit models did not perform better prediction than the linear model in the current study though the two models are more appropriate in theory.
In this study, we compared accuracies of total EBV of four genotyping strategies for three traits. Accuracies of total EBV of three strategies using genomic information outperformed that using only pedigree information, and the accuracies of genotyped individuals were higher than those of non-genotyped individuals in the same strategy. Furthermore, since non-genotyped animal benefit from genomic information of other animals, the accuracies of non-genotyped individuals in scenarios G80_ran or G_alive were higher than the individuals in scenario G_none. Those results are consistent with previous study for piglet mortality using a ssGBLUP method in Danish Landrace and Yorkshire pigs [15]. Among the three strategies using genomic information, accuracies of total EBV of the strategy genotyping all individuals in the reference population was superior to the strategy genotyping only some individuals, the result was also consistent with theoretical expectations [33]. However, with the same size of genotyped individuals, genotyping both alive and dead pigs have a higher accuracy than genotyping only for alive pigs, indicating that the genotypes of dead pigs have an important influence on the accuracy of genomic prediction. Therefore, it could be a good strategy to genotype dead animals. In the current study, genetic values were generated from 730 QTLs for which the direct and maternal additive genetic effects followed a bivariate distribution, since previous studies [34] have revealed that pig mortality is a complex trait and has a polygenic genetic architecture. In case of pig mortality is controlled by a small number of genes, the frequency of unfavorable genes would be largely different between dead animals and alive animals, implying greater need to genotype dead animals for genomic prediction of pig mortality. A study based on real data of pig mortality will be of great importance, however genotype data of dead pigs are not available currently in a pig breeding program. As expected, the trait with higher heritability had higher prediction accuracy. Further, with the same heritability for direct and maternal additive genetic effect of traits T 4/4 and T 2/2 , accuracies of direct EBV (a) were higher than those of maternal EBV (m) for scenarios of G_all, G80_ran, and G_none, indicating maternal genetic effect is more difficult to estimate in general (Table 1). However, accuracies of maternal EBV were higher than those of direct EBV in scenario of G_alive, achieving accuracies similar to those in scenario G80_ran, suggesting selective genotyping for alive animal has small impact on prediction accuracy for maternal additive genetic effect, but large impact on predicting direct additive genetic effect.
We compared the accuracy of genomic prediction of a linear model, a logit model and a probit model for survival in pigs. Using pedigree information, accuracies of total EBV were very similar among the three models, the differences were less than 1% for all traits T 4/4 , T 2/4 and T 2/2 . Previous studies have shown that linear, the logit and probit models have similar predictive capabilities for threshold traits [19,20,36]. In a simulation study, Carlén et al. [36] showed the prediction ability of linear and threshold models were very similar for mastitis which was defined as a binary trait in Dairy Cattle. Koeck et al. [19] evaluated the performance of a linear, a logit and a probit model for genetic analyses of clinical mastitis in Austrian Fleckvieh dual purpose cows and showed that there were very small differences in the predictive ability among the three models. In a Norwegian Red cows population, Vazquez et al. [20] also observed similar results when comparing the genetic predictive ability of threshold and linear models for clinical mastitis. Using genomic information, accuracies of total EBV were higher than those only using pedigree information, but like pedigreebased prediction, accuracies were very similar among linear, logit and threshold models for all the three traits in the current study. Although the logit and probit models were hypothesized to be more suitable for threshold traits, the results indicated that the predictive power of the linear, the logit and probit models are similar in genomic prediction for survival traits.

Conclusions
In this study, three survival traits with different heritabilities were simulated to explore the impact of genotyping strategies and statistical models on genomic prediction. The results showed that genomic predictions with genotypes only from alive animals were unbiased when using variance components estimated from pedigree-based model. Randomly genotyping individuals can obtain higher accuracy than only genotyping alive individuals, given the same number of genotyped individuals. The predictive powers of the linear model, the logit and probit models were similar. We conclude that the genomic information of dead individuals is very useful, and linear model is a good choice for genomic prediction of survival in pigs. It is recommended to use variances estimated from pedigree-based model for genomic prediction in the case of selective genotyping.
Additional file 1: Table S1. Correlation coefficient between the EBV and true breeding values for validation individuals with or without genotypes. Table S2. Regression coefficient of the EBV from whole data on the EBV from reference data for validation individuals with or without genotype. Table S3. Estimates of variances and heritability using a linear model without maternal additive genetic effect for the trait T 4/4 .